Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a ring channel to avoid blocking write of events #2082

Merged
merged 2 commits into from
Feb 14, 2018

Conversation

aledbf
Copy link
Member

@aledbf aledbf commented Feb 13, 2018

Which issue this PR fixes:

fixes #2022
closes #2081

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 13, 2018
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these OWNERS Files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 13, 2018
@@ -106,7 +107,7 @@ func NewNGINXController(config *Configuration, fs file.Filesystem) *NGINXControl
}),

stopCh: make(chan struct{}),
updateCh: make(chan store.Event, 1024),
updateCh: channels.NewRingChannel(4096),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this specify a limit of 4096? If so, the initial load does not appear to pop items off the ring until the cache has been initialized. This still might cause a problem. Can we lower to 1024 to confirm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me lower that value and create a different docker image

Copy link

@azweb76 azweb76 Feb 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. still works, my guess is the ring can expand beyond 1024.

@aledbf
Copy link
Member Author

aledbf commented Feb 13, 2018

@azweb76 please use quay.io/aledbf/nginx-ingress-controller:0.326 (new value is 1024)

@azweb76
Copy link

azweb76 commented Feb 13, 2018

Correct me if Im wrong. On startup all events are written to the ring. Then later popped off the ring.
https://github.com/kubernetes/ingress-nginx/blob/master/internal/ingress/controller/nginx.go#L246-L273

Given this, only a maximum of 1024 events will be tracked in sync. If I understand, the ring is intialized as 1024/4096/etc and if the cluster exceeds this, the first N items are dropped before they are processed.

@azweb76
Copy link

azweb76 commented Feb 13, 2018

Seems like we need to start processing the channel ring before we append to it so we can treat it like a buffer on initial startup.

@aledbf
Copy link
Member Author

aledbf commented Feb 13, 2018

Given this, only a maximum of 1024 events will be tracked in sync. If I understand, the ring is intialized as 1024/4096/etc and if the cluster exceeds this, the first N items are dropped before they are processed.

Yes. That does not mean we are not handling the events. Keep in mind we use a work queue for the update of the configuration and we discard all the event in a time window (all the events in the time lapse we start and finish the update).

@aledbf
Copy link
Member Author

aledbf commented Feb 13, 2018

Seems like we need to start processing the channel ring before we append to it so we can treat it like a buffer on initial startup.

This is one example of the chicken and the egg problem, without the store we don't have events and without the syncQueue we cannot start to process updates but syncQueue depends on the store.

Using the ring channel is the right solution because it allows us to discard the some of the initial events

@azweb76
Copy link

azweb76 commented Feb 13, 2018

Sure. After it has started. But this loop is what processes those items and since the loop doesnt happen until all initial events are loaded, only up to the updateCh limit is processed.

https://github.com/kubernetes/ingress-nginx/blob/master/internal/ingress/controller/nginx.go#L302
https://github.com/eapache/channels/blob/master/ring_channel.go#L92

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 13, 2018
@aledbf
Copy link
Member Author

aledbf commented Feb 13, 2018

@azweb76 what are you proposing exactly?

@azweb76
Copy link

azweb76 commented Feb 13, 2018

I greatly appreciate the effort you're putting into this fix. I just want to make sure we're not introducing another bug. You certainly know more about this than I do, so I will be quiet now. Thanks again!

@aledbf
Copy link
Member Author

aledbf commented Feb 13, 2018

I greatly appreciate the effort you're putting into this fix

The update channel was introduced because of the store refactor. This is something we needed because it was almost impossible to have tests for the informers and all the logic behind the sync process.
So I have to fix this, this is not optional :)

@aledbf
Copy link
Member Author

aledbf commented Feb 13, 2018

The issue in 2022 is related to the number of events we receive at startup that causes a write contention in the channel (we exceed the buffer size). The change introduced here allows us to discard events when we exceed the defined size. As I mentioned previously we don't need all the events (from the start) but at the same time, we cannot run any filter at this point because of the lack of context.

If you run the test image and increase the log level to 3 using the flag --v=3 you should see lots of "skipping %v sync (%v > %v)". That means those events in the work queue where received in the middle of an update and are not really required to process (this avoids unnecessary reloads)

@azweb76
Copy link

azweb76 commented Feb 13, 2018

awesome. Looking forward to the release. 😄

@aledbf aledbf merged commit 9bcb5b0 into kubernetes:master Feb 14, 2018
@aledbf aledbf deleted the ring branch February 15, 2018 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NGINX 0.10.2 not starting
4 participants